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Abstract 

We consider the problem of online combinatorial optimization under semi-bandit feedback, where 
a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize 
the total losses associated with its decisions. After making each decision, the learner observes the 
losses associated with its action, but not other losses. For this problem, there are several learning 
algorithms that guarantee that the learner’s expected regret grows as 0{'Jt) with the number of 
rounds T. In this paper, we propose an algorithm that improves this scaling to 0{\/L^), where 
is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a 
partial-feedback scheme, and the hrst one to do so in a combinatorial setting. 

Keywords: online learning, online combinatorial optimization, semi-bandit feedback, follow the 
perturbed leader, improvements for small losses, hrst-order bounds 


1. Introduction 

Consider the problem of sequential multi-user channel allocation in a cognitive radio network (see, 
e.g., Gai et ah, 2012). In this problem, a network operator sequentially matches a set of N secondary 
users to a set of M channels, with the goal of maximizing the overall quality of service (QoS) 
provided for the secondary users, while not interfering with the quality provided to primary users. 
Due to different QoS preferences of users and geographic dispersion, different users might perceive 
the quality of the same channel differently. Furthermore, due to uneven traffic on the channels 
and other external conditions, the quality of each matching may change over time in a way that is 
very difficult to model by statistical assumptions. Formally, the loss associated with user i being 
matched to channel j in the decision-making round is C [0; 1]> the goal of the network 
operator is to sequentially select matchings Vt so as to minimize its total loss Ylt=i 
after T rounds. It is realistic to assume that the operator learns about the instantaneous losses of the 
allocated user-channel pairs after making each decision, but counterfactual losses are never revealed. 

Among many other sequential optimization problems of practical interest such as sequential 
routing or online advertising, the above problem can be formulated in the general framework of 
online combinatorial optimization (Audibert et ah, 2014). This learning problem can be formalized 
as a repeated game between a learner and an environment. In every round t = 1, 2,... , T, the 
learner picks a decision Vt from a combinatorial decision set 5 C {0, l}*^. Simultaneously, the 
environment fixes a loss vecfor £t £ [0,1]'^ and fhe learner suffers a loss of Vtit. We assume fhaf 
ll'i’lli < fn holds for all v £ S, enfailing Vt £t < m. At fhe end of fhe round, fhe learner observes 
some feedback based on Vt and £t. The simples! seffing imaginable is called the. full-information 
selling where fhe learner observes fhe entire loss vecfor £t. In mosl practical siluafions, however, fhe 
learner cannol expecl such rich feedback. In Ihis paper, we focus on a more realisfic and challenging 
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feedback scheme known as semi-bandit: here the learner observes the subset of components * of 
the loss vector with Vt^i = 1. Note that this precise feedback scheme arises in our cognitive-radio 
example. The performance of the learner is measured in terms of the regret 

T 

Rt = max {Vt - vf it, 

that is, the gap between the total loss of the learner and that of the best fixed action. The interaction 
history up to time t is captured by Rt-i = <7(Vi,..., V^-i). In the current paper, we focus on 
oblivious environments who are only allowed to pick each loss vector it independently of Ft-i- 
The learner is allowed to (and, by standard arguments, should) randomize its decision Vt based on 
the observation history Ft-i- With these remarks in mind, we will focus on the expected regret 
E [i?^] from now on, where the expectation integrates over the randomness injected by the learner. 

Most of the literature is concerned with finding algorithms for the learner that guarantee that 
the regret grows as slowly as possible with T. Of equal importance is establishing lower bounds 
on the learner’s regret against specific classes of environments. Both of these questions are by 
now very well-studied, especially in the simple case where S is the set of d-dimensional unit 
vectors; this setting is known as prediction with expert advice when considering full feedback 
(e.g., Cesa-Bianchi and Lugosi, 2006) and the multi-armed bandit problem when considering semi¬ 
bandit feedback (e.g., Auer et ah, 2002a). In these settings, the minimax regret is known to be of 
&{s/T log d) and Q{s/dT), respectively. Several learning algorithms are known to achieve these 
regret bounds, at least up to logarithmic factors in the bandit case, with the notable exception of 
the PolyINF algorithm proposed by Audibert and Bubeck (2009). The minimax regret for the 
general combinatorial setting was studied by Audibert et al. (2014), who show that no algorithm 
can achieve better regret than Q.{m-\/T log(d/m)) in the full-information setting, or Q{\/mdT) 
in the semi-bandit setting. Audibert et al. also propose algorithms that achieve these guarantees 
under both of the above feedback schemes. Furthermore, they show that a natural (although not 
always efficient) extension of the Exp3 strategy of Auer et al. (2002a) guarantees a regret bound 
of 0(mY^dTTog(d7?^) in the semi-bandit setting (see also Gydrgy et al., 2007). A computa¬ 
tionally efficient strategy for the same setting was proposed by Neu and Bartok (2013), who show 
that an augmented version of the FPL algorithm of Kalai and Vempala (2005) achieves a regret of 
0{my/dT log d), essentially matching the bound of Exp3. 

Even though the above guarantees cannot be substantially improved under the worst possible 
realization of the loss sequence, certain improvements are possible for specific types of loss se¬ 
quences. Arguably, one of the most fundamental of these improvements are bounds that replace the 
number of rounds T with the loss of the best action = min^g^ Lt, thus guaranteeing a regret 
of 0{s/L^). Such improved bounds, often called first-order regret bounds, are abundant in the 
online learning literature when assuming full feedback: the key for obtaining such results is usually 
a clever tuning rule for otherwise standard learning algorithms such as Hedge (Cesa-Bianchi et al., 
2005; Cesa-Bianchi and Eugosi, 2006) or EPE (Hutter and Poland, 2004; Kalai and Vempala, 2005; 
Van Erven et al., 2014). The intuitive advantage of such first-order bounds that they can effectively 
take advantage of “easy” learning problems where there exists an action with superior performance. 
In our cognitive-radio example, this corresponds to the existence of a user-channel matching that 
tends to provide high quality of service. 

One obvious question is whether such improvements are possible under partial-information con¬ 
straints. We can answer this question in the positive, although such bounds are far less common 
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than in the full information case. In fact, we are only aware of three algorithms that achieve such 
bounds: Exp3Light described in Section 4.4 of Stoltz (2005), Green by Allenberg et al. (2006) 
and SCRiBLe by Abemethy et al. (2012), as shown by Rakhlin and Sridharan (2013)^. These al¬ 
gorithms guarantee regret bounds of 0{dy /log d), 0{y/dL^ log d) and 0{d^^‘^VL^ \og{dT)) 
in the multi-armed bandit problem, respectively. These results, however, either do not generalize 
to the combinatorial setting (see Section 2 for a discussion on Green) or already scale poorly 
with the problem size in the simplest partial-information setting. Furthermore, implementing these 
algorithms is also not straightforward for combinatorial decision sets. 

In this paper, we propose a computationally efficient algorithm that guarantees similar improve¬ 
ments for combinatorial semi-bandits. Our approach is based on the Follow-the-Perturbed-Leader 
(FPF) algorithm of Hannan (1957), as popularized by Kalai and Vempala (2005). We show that an 
appropriately tuned variant of our algorithm guarantees a regret bound of O (mV dL^ \og{d/m )'), 
largely improving on the minimax-optimal bounds whenever = o(T). In the case of multi¬ 
armed bandits where m = 1, the bound becomes 0{^/dL^ log d). Notice however that when 
m > 1, can be as large as Q{mT) in the worst case, making our bounds inferior to the best 
known bounds concerning FPF and Exp3. To circumvent this problem, as well as the need to know 
a bound on to tune our parameters, we also propose an adaptive variant of our algorithm that 
guarantees a regret of 0(mVmin{(iL^, dT} log{d/m)). Thus, our performance guarantees are in 
some sense the strongest among known results for non-stochastic combinatorial semi-bandits. 

Besides first-order bounds, there are several other known ways of improving worst-case perfor¬ 
mance guarantees of 0{VT) for non-stochastic multi-armed bandits. A common improvement is re¬ 
placing T by the gain of the best action, T — L^ (see, e.g., Auer et ah, 2002a; Audibert and Bubeck, 
2009). Such bounds, while helpful in some cases where all actions tend to suffer large losses (e.g., in 
online advertising where even the best ads have low clickthrough rates), are not as satisfactory as our 
bounds: these bounds get worse and worse as one keeps increasing the gain of the best action, even if 
all other losses are kept constant, despite the intuition that this operation actually makes the learning 
problem much easier. That is, bounds of the above type fail to reflect the “hardness” of the learning 
problem at hand. The work of Hazan and Kale (2011) considers a much more valuable type of im¬ 
provement: they provide regret bounds of 0{d?^/ Qt), where Qt = min^^Rd Y^=i W^t — /r II2 is 
the quadratic variation of the losses. Such bounds are very strong in situations where the sequence 
of loss vectors “stays close” to its mean in all rounds. Notice however that, unlike our first-order 
bounds, this improvement requires a condition to hold for entire loss vectors and not just the loss of 
the best action. This implies that first-order bounds are more robust to loss variations of obviously 
suboptimal actions. On the other hand, it is also easy to construct an example where grows 
linearly while Qt is zero. In summary, we conclude that first-order bounds and bounds depending 
on the quadratic variation are not comparable in general, as they capture very different kinds of 
regularities in the loss sequences. For further discussion of higher-order and variation-dependent 
regret bounds, see Cesa-Bianchi et al. (2005) and Hazan and Kale (2010). We also mention that 
several other types of improvements exist for full-information settings—we refer to recent works of 
Rakhlin and Sridharan (2013), Sani et al. (2014) and the references therein. 

Finally, let us comment on related work on the so-called stochastic bandit setting where the 
loss vectors are drawn i.i.d. in every round. In this setting, combinatorial semi-bandits have been 
studied under the name “combinatorial bandits” (Gai et ah, 2012; Chen et ah, 2013), giving rise to 

I. The obscure nature of such first-order bounds is reflected by the fact that Rakhlin and Sridharan prove their corre¬ 
sponding result simply because they were not aware of the two previous results. 
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a bit of confusion^. This line of work focuses on proving bounds on the pseudo-regret defined as 
max „£5 — vY pi, where /x G is the mean of the random vector li. We highlight the 

result of Kveton et al. (2015), who have very recently proposed an algorithm that guarantees bounds 
on the pseudo-regret of 0(md(l/A) logT) for some distribution-dependent constant A > 0 and a 
worst-case bound of 0{y/mdT logT). Note however that comparing these pseudo-regret bounds to 
bounds on the expected regret can be rather misleading. In fact, a simple argument along the lines 
of Section 9 of Audibert and Bubeck (2010) shows that even algorithms with zero pseudo-regret 
can actually suffer an expected regret of when permitting multiple optimal actions. A more 

rehned argument shows that this bound can be tightened to n(-\/Z^) when assuming non-negative 
losses, suggesting that hrst-order bounds on the expected regret are in some sense unbeatable even 
in a distribution-dependent setting^. 


2. From zero-order to first-order bounds: Keeping the loss estimates close together 

We now explain the key idea underlying our analysis. Our approach is based on the observa¬ 
tion that regret bounds for many known bandit algorithms (such as Exp 3 by Auer et al. 2002a, 
OSMD with relative-entropy regularization by Audibert et al. 2014, and the bandit FPL analysis of 
Neu and Bartok 2013) take the form 


T d 

V 'y ^ ^ ^ + 

t=l i=l 


D 

V 


d 

< 7 / ^ LT,i + 
i=l 


D 

1 

p 


( 1 ) 


where it,! is an estimate of the loss it,i^ Lt,! = Yl"t=i r/ > 0 is a tuning parameter, and D > 0 
is a constant that depends on the particular algorithm and the decision set. The standard approach is 
then to design the loss estimates to be unbiased so that the above bound becomes p Yli=i ^T,i + ^ 
after taking expectations. Unfortunately, this form does not permit proving first-order bounds as 
may very well be U(r) for either i even in very easy problem instances—that is, even an optimized 
setting of p gives a regret bound of 0{s/dDT) at best. Applying a similar line of reasoning, one can 
replace T in the above bound by (T — L^), the largest total gain associated with any component, 
but, as already discussed in the introduction, this improvement is not useful for our purposes. 

In this paper, we take a different approach to optimize bounds of the form (1). The idea is 
to construct a loss-estimation scheme that keeps every LT,i “close” to = min^g^ Lt, the 
estimate of the optimal action in the sense that 

Lt,! <L*T + d(^^y ( 2 ) 

Observe that this property allows rewriting the bound (1) as pdL"^ 0(1). Of course, a 

loss-estimation scheme guaranteeing the above property has to come at the price of a certain bias. 
Guaranteeing that the bias satisfies certain properties and is optimistic in the sense that < L^, 
we can arrive at a first-order bound by choosing p = ©(yU/Z^). The remaining challenge is to 

2. The term “combinatorial bandits” was first used by Cesa-Bianchi and Lugosi (2009), in reference to online combina¬ 
torial optimization problems under full bandit feedback where the learner only observes Vt it after round t. 

3. Kazan and Kale (2010) use a similar argument to show that variation-dependent bounds are unbeatable for signed 
losses in a similar sense. 
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come up with an adaptive learning-rate schedule that achieves such a bound without prior knowledge 
of L^. 

Our approach is not without a precedent: Allenberg et al. (2006) derive a first-order bound for 
multi-armed bandits based on very similar principles. Their algorithm, called Green, relies on a 
clever trick that prevents picking arms that seem suboptimal. Specifically, Green mainfains a sef 
of weighfs wt^i over fhe arms and compufes an auxiliary probabilify disfribufion pt^i oc wt^i- The 
frue sampling disfribufion over fhe arms is compufed by seffing pt i = 0 for all arms such fhaf is 
below a certain fhreshold 7 , and fhen redisfribufing fhe removed weigh! among fhe remaining arms 
proporfionally fo wt^i- The infuifive effecf of fhis fhresholding operation is fhaf poorly perform¬ 
ing arms are eliminafed, which harnesses fhe furlher growfh of fheir respecfive esfimafed losses. 
Specifically, Allenberg ef al. show fhaf properly (2) and simullaneously hold for fheir 

algorilhm, paving fhe way for fheir firsl-order bound. 

While providing sfrong technical resulls, Allenberg el al. (2006) give liflle inluilion as fo why 
fhis approach is key fo obfaining firsf-order bounds and how fo generalize fheir algorilhm fo more 
complicaled problem sellings such as ours. Even if one is able fo come up wifh a generalizafion on 
a conceplual level, efficienl implemenfafion of such a varianl would only be possible on a handful 
of decision sels where Exp 3 can be implemented in fhe firsl place (see, e.g., Koolen el ah, 2010; 
Cesa-Bianchi and Eugosi, 2012). The probabilistic nalure of fhe approach of Allenberg el al. does 
nol seem fo mix well wifh fhe mirror-descenl lype algorifhms of Audiberf ef al. (2014) eilher, whose 
proofs rely on fools from convex analysis. In fhe currenl paper, we propose an alternative way fo 
reslricl sampling of suboplimal acfions fhaf leads fo properly ( 2 ) in a much more Iransparenl and 
infuifive way. 


3. The algorithm: FPL with truncated perturbations and implicit exploration 

Our algorilhm is a varianl of fhe well-known Eollow-lhe-Perfurbed-Eeader (EPE) learning algo¬ 
rilhm (Hannan, 1957; Kalai and Vempala, 2005; Hutter and Poland, 2004; Neu and Barlok, 2013), 
equipped wifh a perlurbalion scheme fhaf will enable us fo prove firsf-order bounds Ihrough guar- 
anleeing property (2). In every round t, EPE chooses ils acfion as 

Vt = [ptLt-i - Zt\ (3) 

v^S ^ ^ 


where r/j > 0 is a parameler of fhe algorilhm, Lt-i is a vecfor serving as an eslimale of fhe cumu- 
lafive loss vecfor Lt-i = X]s=i and Zt G is a vector of random perlurbalions. EPE is very 
well-sludied in fhe full-informalion case where one can choose Lt-i = Lt-P, several perlurba- 
fion schemes are known fo work well in Ibis seffing (Kalai and Vempala, 2005; Rakhlin ef ah, 2012; 
Devroye el ah, 2013; Van Erven el al., 2014; Abernelhy ef al., 2014). In whal follows, we focus on 
exponentially distributed perturbations, which is fhe only scheme known fo achieve near-optimal 
performance guarantees under bandil feedback (Poland, 2005; Neu and Barlok, 2013). 

In order fo guaranlee fhaf fhe condilion (2) is satisfied, we propose fo suppress suboptimal 
actions by using bounded-support perlurbalions. Specifically, we propose fo use a truncated expo¬ 
nential distribution wifh fhe following densily function: 


fsiz) 


ile-^ , if 2 ; G [0, B] 

0 ofherwise. 
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Here, H > 0 is the bound imposed on the perturbations. In each round t, our FPL variant draws 
components of the perturbation vector Zt independently from an exponential distribution truncated 
at Bt > 0, another tuning parameter of our algorithm. To define our loss estimates, let us define 
qt^i = E [Vt^i \ Tt-i] and the vector it with components 


T _ kiyt,i 

■v/- 7 - 

’ qt,i + It 


(4) 


where > 0 is the so-called implicit exploration (or IX) parameter of the algorithm controlling the 
bias of the loss estimates. Notice that E^^ j < it^i holds by construction for all i. Then, Lt is simply 
defined as Lt = In what follows, we refer to our algorithm as FPL-TrIX, standing 

for “FPL with truncated perturbations and implicit exploration”. Pseudocode for FPL-TrIX is 
presented as Algorithm 1 . 


Algorithm 1 FPL-TrIX 

Parameters: Learning rates {pt), implicit exploration parameters ( 7 ^), truncation parameters {Bt). 

Initialization: Lq = 0. 

For f = 1,2,..., T, repeat 


1. Draw perturbation vector Zt with independent components Zt^i ~ fst ■ 

2. Play action 

Vt = argminr?"^ [ptLt-i - ZA . 

ves ^ ^ 

3. For all i, observe losses it,iVt,i and compute £t,i = 

4. Set Lt = Lt—i -|- it- 


It will also be useful to introduce the notations D = log(d/m) -|-1 and A = For technical 
reasons, we are going to assume that the sequence of learning rates {pt)t^ exploration parameters 
{'Pt)t and truncation parameters (/3t)^ are all nonincreasing. 

Before proceeding, a few comments are in order. First, note that the probabilities qt^i are gen¬ 
erally not efficiently computable in closed form. This issue can be circumvented by the simple 
and efficient loss-estimation method proposed by Neu and Bartok (2013) that produces equivalent 
estimates on expectation; we resort to the loss estimates (4) to preserve clarity of presentation. Oth¬ 
erwise, similarly to other FPL-based methods, FPL-TrIX can be efficiently implemented as long as 
the learner has access to an efficient linear-optimization oracle over S. Second, we remark that loss 
estimates of the form (4) were first proposed by Kocak et al. (2014) as an effective way to trade off 
the bias and variance of importance-weighted estimates. Finally, one may ask if the truncations we 
introduce are essential for our algorithm to work. Answering this question requires a little deeper 
technical understanding of FPL-TrIX than the reader might have at this point, and thus we defer 
this discussion to Section 6 . (For the impatient reader, the short answer is that one can get away 
without truncations at the price of an additive 0(log T) term in the bounds. Note however that the 
proof of this result still relies on the analysis of FPL-TrIX that we present in this paper.) 
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3.1. Some properties of FPL-TrIX 

In this section, we present some key properties of our algorithm. We first relate the predictions 
of FPL-TrIX to those of an FPL instance that employs standard (non-truncated) exponential per¬ 
turbations. Specifically, we study the relation between the expected performance of FPL-TrIX 
that selects the action sequence (V*) and an auxiliary algorithm that uses Si fixed exponentially- 
distributed perturbation vector Z, and plays 


Vt = argminn’^ [dtLt-i - Z 
v€S 

in round t. In particular, we are interested in the relation between the quantities 


(5) 


Pt{v) = F [Vt = v\ Tt-i], Ptiv) = 

= qt,i = F 


Vt = v 




Vt 


tA 


Tt-i 


defined for all t, i and v. The following lemma establishes a bound on the total variation distance 
between the distributions induced by Z and Z, and thus relates the above quantities to each other. 

Lemma 1 Let the components of Z and Z be drawn independently from fst cmd foo, respectively. 
Then, for any function G : M ^ [0; 1]> have 'EG{Z) — MG{Z) < fitd. In particular, this 
implies that \pt{v) — Pt{v)\ < fitdfor all t and v and \qt^i — qtf < fitdfor all t and i. 

Proof For ease of notation, define / = /oo, g = F,G{Z) and g = KG{Z). We first prove 
5 < 5 + Ptd. To this end, observe that by the definition of fs^. 


g= j G{z)fBt{z)dz 


< 


1 


ze[o,St]'i 


(1 


_ p—Bt 


G{z)f{z)dz = 


g 


2 :£[ 0 ,co]'^ 


(1 


_ p—Bt 


After reordering and using the inequality (1 — x)'^ >\ — dx that holds for all x < 1 and all d > 1, 
we obtain g{l — fitd) < g. The upper bound on g follows from reordering again and using g <1. 
To prove the lower bound on g, we can use a similar argument as 


g = 


G{z)fBt{z)dz = 


1 


> 


z€[t),BtY 

1 


(1 


-p-BtV 


G{z)f{z)dz 


zG[0,Bt]'^ 


(1 


-P-Bt)^ 




f{z)dz = 




(1 


- p-BtV 




(1 


-p-BtV 


z£[Bt,oo)‘^ 

After reordering and using (1 — x)'^>l — dx again, we obtain 

g <g{l- + (l- {l- <g + fitd, 

concluding the proof. 


The other important property of FPL-TrIX that we highlight in this section is that the loss 
estimates generated by the algorithm indeed satisfy property ( 2 ). 
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Lemma 2 Assume that the sequences {At) and (Pt) are nonincreasing. 
V & S, we have 


Lt,! < v~'^ Lt + 


m{D + Bt) 

VT 



Then for any i and 


Proof Fix an arbitrary i and v and let r denote the last round in which qt^i > 0. This entails that 
= Lr^i holds almost surely, as Vt^i = 0 for all f > r. By the construction of the algorithm and 
the perturbations, q^-^i > 0 implies that there exists a w with Wi = 1 and pt{w) > 0. Thus, 


Brm ~-r^> Brm 

w^Lr-i H-< VJLr-i H- 

ues rjr rjT 

= Vf (tr-l - -z) + -Vfz + 

\ Vt J ht AT 


L^- 


T —1 


1 Vf Z + BTm 

— Z I H-, 

Vt ) AT 


where the first inequality follows from the fact that pt{w) > 0 , the second one follows from 
Bt/At < Bt/pt and the last one from the definition of After integrating both sides with 
respect to the distribution of Z and bounding ir^i < l/ 7 r < l/ 7 T^ we obtain the result as 


7 7 7 t 7 1 miD + Br) 1 

—Lr—lji + ^T,i ^ i'r —1 H-^ Lx H- 1 -, 

Ax Vt At 


where we used the fact that Ltj is nonnegative for ally, Wi = 1 , and E[V^^Z] < m (log(d/m) + 1 ) = 
mD, which follows from Lemma 10 stated and proved in the Appendix. ■ 


4. Regret bounds 

This section presents our main results concerning the performance of FPL-TrIX under various 
parameter settings. We begin by stating a key theorem. 

Theorem 3 Assume that the sequences (pt), (At) tind (fit) are nonincreasing and I3td < At holds 
for all t. Then for all v G S, the total loss suffered by FPL-TrIX satisfies 

T ^ mD ^ ^ 

Vf£t <v^Lx H-h ^ (ptm + Ptd + 7*) ' hi- 

t=i t=i i=i 

The proof of the theorem is deferred to Section 5. Armed with this theorem, we are now ready to 
prove our first main result: a first-order bound on the expected regret of FPL-TrIX. 

Corollary 4 Consider FPL-TrIX run with the time-independent parameters 7 = pm and j3d = a 
(and thus B = log(d/m) — log p). The expected regret of the resulting algorithm satisfies 

IE [Rt] <-1- ^pmdLf -|- 2>m?d{D -\- B) -\- 3d. 

In particular, setting p = min |l, guarantees 

IE [Rt] < 5.2771-^/ dLf {\og{d/m) -|- 1) -|- l.hm^dmax{log(dL^), 0} -|- O {m^d\og{d/m)) . 
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Proof Let = argmin^g^ The proof of the first statement follows directly from com¬ 

bining the bounds of Theorem 3 and Lemma 2 for n = n*, taking expectations and noticing 
that K[vILt] < L^. For the second statement, first consider the case when rj = 1 and thus 
/? = r]{m/d) = m/d, giving B = log(l//3) = \og{d/m). Now notice that the setting of r/ implies 
Lt — {3D)/d and thus < y^^JSL^lTjJd. Then, substituting the value of rj into the first bound 
of the theorem gives 

IE [Rt] <3m-y3dL^^ + mD + 2>m?d{2 \og{d/m) + l) + 3d, 

proving the statement as 3\/3 < 5.2. For the case rj < 1, the bound follows from substituting the 
value of rj as 

IE [i?T] <2m^3dL^D + -m?'d\og{dL^) + 3m^d(2log(d/ni) -|- l) + 3d, 

where we used that B = log(d/m) + log(l/ 7 /) and \og{l/rj) < log(dLy)/2. ■ 

Notice that achieving the above bounds requires perfect knowledge of Lf, which is usually not 
available in practice. While one could use a standard doubling trick to overcome this difficulty, we 
choose to take a different path to circumvent this issue, and propose a modified version of FPL- 
TrIX fhat is able fo fune ifs learning rale and olher paramelers solely based on observafions. We 
nofe fhaf our funing rule has some unorthodox qualifies and mighl be of independenl inferesl. 

Similarly fo the parameter choice suggested by Corollary 4, we will use a single sequence of 
decreasing non-negative learning rates {rjt) and set 7 ^ = mrjt and jdt = {m/d)rjt for all t. For 
simplicity, let us define fhe nolalions st = Yli=i and St = Sk, wifh Sq = ^ > 0. 

With these notations, we define our funing rule as 


Notice fhaf 71 = D, and fhus /3i = 71 (m/d) = (m/d) (log(d/m) + 1) < 1, ensuring fhaf i?i > 0 
and fhe algorifhm is well-defined. This follows from fhe inequalify z{l — log z) < 1 fhaf holds for 
all ^ G (0,1). The delicacy of the tuning rule ( 6 ) is that the terms st are themselves bounded in 
terms of the random quantity l/rjt, and not some problem-dependent constant. To the best of our 
knowledge, all previously known analyses concerning adaptive learning rates apply a deterministic 
bound on st at some point, largely simplifying the analysis. As we will see below, treating this issue 
requires a bit more care than usual. The following theorem presents the performance guarantees of 
the resulting variant of FPL-TrIX. 

Theorem 5 The regret of FPL-TrIX with the adaptive learning rates defined in Equation ( 6 ) si¬ 
multaneously satisfies 

IE [Rt] dLf (log(d/m) + 1 ) + O (m^dlog(dT)) 

and 


IE [Rt] <13mY^ dT (log(d/m) + 1 ) + 9.49m. 
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Neu 


Proof Let i;* = arg min^g^ v~^Lt- First, notice that the learning-rate sequence defined by Equa¬ 
tion (6) is nonincreasing as required by Theorem 3. Also note that st is nonnegative and is bounded 
by ^ ^ for all t, and ^ < St-i holds since St-i > for all t. These facts 

together imply that r]t < yj2D / St as 



= m- 


Combining the above bound with Lemma 3.5 of Auer et al. (2002b), we get 

T T 

St 


^ 2^ 2D St. 

t=i t=i 


Using r]T < y^2DjST again, the right-hand side can be further bounded as 2y/2DST < ^ and 
the bound of Theorem 3 applied for r?* becomes 

T 

Y] Vif-t - vILt < 13m—. (7) 

riT 

t=i 


Now, we are ready to prove the second bound in the theorem. Notice that — 
\/DSt holds by the tuning rule and 


E[St] 


D 


+ ^ E LT,i 


i=l 


<l + dT, 


^ DSt-1 < 


( 8 ) 


where we used that E£t,i < < 1- The statement then follows from plugging this bound into 

Equation (7), taking expectations and using Jensen’s inequality. 

Proving the first bound requires a bit more care. Eirst, an application of Lemma 2 gives 


St < + d 


vILt + 


m{BT + D) + ^ 


rjT 


Now recall that ^ < y/TJSr holds by the tuning rule. Bounding St as above, this implies 


D 


< 


VT \ 


1 -|- dD vJLt + 


m{BT + D) + ^ 


VT 


Solving the resulting quadratic equation for the largest possible value of I/tjt gives 


— <V 1 + dDvJLT + 2md{log(l/(^ t) + D) -\ - 

Vt ^ fn 

<\ldDvlLT + md(log{ST) + 3log(d/m) -|- 2) 1. 

V m 


< y/ vILt = 

y/L^. Einally, we bound E [log(5T)] < log(dT -|- 1) by using the inequality (8). The statement of 
the theorem now follows from substituting into Equation (7) and taking expectations. ■ 


The first term can be directly bounded by using Jensen’s inequality as E 


vJLt 
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5. The proof of Theorem 3 

Finally, let us turn to proving our key theorem. For the proof, we recall the auxiliary forecaster 
defined in Equation (5) that uses a fixed non-truncated perturbation vector Z and also define a 
variant that also allowed to peek one step into the future: 

Vt = argmint)^ irjtLt-i — Zj and V^^ = argminti^ i'lltLt — z) . 
v£S ^ ' vgs ^ ' 


We will use the notation (n) = P = v Ft for all v £ S. 

We start with the following two standard statements concerning the performance of the auxiliary 
forecaster (Neu and Bartok, 2013). Note that the first of these lemmas slightly improves on the 
result of Neu and Bartok (2013) in replacing their log d factor by log(d/m). For completeness, we 
provide the proof of this improved bound in the Appendix. 


Lemma 6 For any v £ S, 


T 

xT ^ ^ "i(log(d/m) + 1) 
t=l uGS ’ 


(9) 


Lemma 7 For all t, 

uGS itg<S 


The following lemma bounds the term on the right-hand side of the above bound. 
Lemma 8 Assume that /3d < 7. Then for all t, 

2 ^ 

Pt{u) < m ^ itj . 

uGS j=l 


Proof The statement is proven as 


'^ptiu) =E 


it€iS 


d d 

i=i j=i 


Ft 


d ~ d 


Qt,i + It 


+ Ptd 7 ^ 7 

— , I _ ■ Z^ ’ 


1=1 


i=i 


where the first inequality follows from the definitions of £t and qt^i and bounding Vtj < 1, the 
second one follows from using Lemma 1 and the last one from /dtd < 7* and || < m. ■ 

Our final lemma quantifies the bias of the learner’s estimated losses. 

Lemma 9 For all t, 


u£S 


d 

>vfet-{7t + mJ2^t,i- 

i=l 
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Proof First, note that by Lemma 1, we have 


d d d 

u£S i=l i=l 2=1 

Then, the proof is eoncluded by observing that 


d 

2=1 


d 

2=1 


VtA,i 

qt,i + It 


d 

2=1 


Vt,iki 

qt,i + 7t 


d 

2=1 


The statement of Theorem 3 follows from piecing the lemmas together. 

6. Discussion 

We conclude by discussing some implications and possible extensions of our results. 

Why truncate? One might ask whether truncating the perturbations is really necessary for our 
bounds to hold. We now provide an argument that shows that it is possible to achieve similar results 
without explicit truncations, if we accept an additive 0(\og{dT)) term in our bound. In particular, 
consider FPL with non-truncated exponential perturbations. It is easy to see that with probability 
at least 1 — 6/{dT), all perturbations remain bounded by 23 = log(^). One can then analyze 
FPL under this condition along the same lines as the proof of Corollary 4, the main difference 
being that we also have to account for the regret arising from the low-probability event that not all 
perturbations are bounded. Bounding the regret in this case by the trivial bound dT, this additional 
term becomes 6dT. Setting 6 = ^JdL^|dT makes the total regret —however, notice that 

this gives B = 0(log(dT)), which shows up additively in the bound. A similar argument can be 
shown to work for the adaptive version of FPL-TrIX. We note that the implicit exploration induced 
by the bias parameter 7 and other techniques developed in this paper are still essential to prove these 
results. 

High-probability bounds. Another interesting question is whether our results can be extended 
to hold with high probability. Luckily, it is rather straightforward to extend Corollary 4 to achieve 
such a result by replacing with (.t,i = ^ log(l + for an appropriately chosen a; > 0 , 

as suggested by Audibert and Bubeck (2010). While such a result would also enable us to handle 
adaptive environments, it has the same drawback as Corollary 4: it requires perfect knowledge of 
L^. Proving high-confidence bounds for the adaptive variant of FPL-TrIX, however, is far less 
straightforward; we leave this investigation for future work. 
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Appendix A. Some technical proofs 

We first prove a statement regarding the mean of the sum of top m out of d independent exponential 
random variables. 


Lemma 10 Let Zi, Z 2 ,..., Zd be i.i.d. exponential random variables with unit expectation and let 
Zf, Z 2 ,..., Z^ be their permutation such that Zf > Z 2 > ■ ■ ■ > Z^. Then, for any 1 <m < d, 


E 




. 1=1 


< m I log 


m 


+ 1 


Proof Let us define Y = Then, as Y is nonnegative, we have for any A > 0 that 


POO 

E[Y]= F[Y>y] 

Jo 


dy 


f 


P 


P 


<A + 

<A^J 

<A + d 
=A + 




.i=l 


z* > ^ 

mJ 


dy 


POO 

/ p 

Zi>JJ- 

Ja 

m. 


dy 

dy 


where the last inequality follows from the union bound. Setting A = m\og{d/m) minimizes the 
above expression over the real line, thus proving the statement. ■ 


With this lemma at hand, we are now ready to prove Lemma 6. 

Proof [Proof of Lemma 6] To enhance readability, define = 1/r/t for f > 1 and po = 0. We start 
by applying the classical follow-the-leader/be-the-leader lemma (see, e.g., Cesa-Bianchi and Lugosi, 
2006, Lemma 3.1) to the loss sequence defined as (£1 — piZ, t 2 — {p 2 — Pi)Z,... , £t — {pt — 
Pt-i)Z) to obtain 

T _ T 

E ^ - PtZ) . 

t=i 
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After reordering and observing that —v^Z < 0, we get 
T T 


t=l 


E -V ^ 

t=i 
T 

<y (/^t — fit-i) • maxu^Z = ht ■ maxu^ Z, 
^^ u£S u£S 


t=l 


where we used that the sequence {fit) is nondecreasing and > 0 for all u £ S. The result 
follows from integrating both sides with respect to the distribution of Z and applying Lemma 10 to 


obtain E 


maxit£5 u^Z 


< m i\og{d/m) + 1). 
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